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ABSTRACT 

For  many  problems  of  interest  in  statistical  pattern 
recognition,  density  estimates  for  a random  variable  X of 
dimension  d are  unreliable  unless  the  number  of  sample 
vectors  is  very  large  (>2^).  For  even  moderately  large  d 
(d  > 12),  sample  sizes  are  often  insufficient.  However,  lower 

Q & 

order  moments  of  the  form  x?  may  be  accurately  estimated. 

In  this  paper  we  are  concerned  with  the  problem  of  optimally 
discriminating  between  two  classes  of  random  variables  in 
terms  of  the  available  information  about  them  of  reasonable 
accuracy  (their  lower  order  moments) . In  no  case  do  we 
make  any  assumption  about  the  form  of  the  probability  den- 
sities of  random  variables  X.  (We  do  in  some  cases  assume 
certain  forms  for  the  densities  of  functions  of  these  ran- 
dom variables  L(X).) 

First  we  consider  the  performance  of  the  Gaussian  dis- 
criminant function  for  arbitrary  class  distributions.  How 
well  this  function  discriminates  depends  on  the  magnitudes 
of  the  interclass  differences  of  the  first  four  moments. 

It  is  shown  how  to  adjust  the  constant  term  of  the  Gaussian 
discriminant  to  obtain  minimum  probability  of  error.  Then 
a second  order  solution  for  the  optimal  quadratic  discrimi- 
nant is  given.  Finally  the  methods  developed  are  applied 


to  determine  general  discriminant  functions. 

i i i 


PRECEDING  PAGE  UJK 


) 


CONTENTS 


Abstract 

I.  Introduction 

II.  The  Gaussian  Discriminant 

III.  The  Minimal  Variance  Solution 

IV.  The  Quadratic  Discriminant  Function 

V.  General  Discriminant  Functions 
References 


v 


iii 


1 

3 

8 

11 

14 

20 


. 


1 


I.  INTRODUCTION 


Suppose  two  classes  of  events  are  mutually  exclusive  and 
exhaustive.  Let  these  classes  be  denoted  by  coj  and  and 
their  respective  probabilities  by  P((o^)  and  I' ( to 2 ) * Suppose 
further  that  each  event  has  associated  with  it  a vector  X in 
R1^.  This  vector  contains  the  only  observable  information 
about  the  event.  We  then  have  a random  variable  defined  on 
each  class.  Denote  the  probability  densities  of  these  random 
variables  by  P1 (X)  and  p2(X).  Let  Mj , M2,2j  and  S2  be  the 
means  and  covariance  matrices  for  these  densities.  Although 
we  may  indeed  not  know  the  form  of  the  densities  p^  (X)  and 
p,(X),  we  may  estimate  (or  know)  the  first  and  second  moments 
listed  above  as  well  as  various  higher  moments. 

For  many  practical  problems  this  is  the  case. 

A collection  of  sample  X's  of  class  and  sample  X's  of  class 
a),  are  used  to  estimate  the  densities  p^X),  p9(X)  and  the 
moments.  However  small  sample  sizes  (<2^)  prevent  us  from 
accurately  estimating  these  densities.  But  moments  of  the 
form  x ? x-  may  be  reliably  estimated  from  samples  of  sizes 
which  are  not  exponential  in  d.  Hence  in  many  actual  situations, 
where  d is  only  moderately  large  (d  > 12),  the  only  reliable 
information  about  the  classes  co^  and  a>2  consists  of  lower  order 
moments.  Using  this  information  how  do  we  best  discriminate 
between  and  co -, ? This  is  exactly  the  mathematical  problem 
we  shall  consider. 

If  we  observe  a certain  vector  X,  how  do  we  decide  with 
which  class  it  is  most  likely  associated?  One  method  is  the 
Gaussian  discriminant  defined  as  follows:  Assign  an  observed 
test  vector,  X,  to  class  1 or  2 according  to  the  magnitude 
of  the  following  expression  - 


y 


a 


In  | S2I  - 'j  In  | 2j 


‘JCX-M,)t2’1lX-M2)  - ^(X-Mj)1  Z’^X-Mj)  ♦ S 

| > t class  1 
j t class  2 
where  the  threshold 
t = In  P(a)1)  - In  P(u> ,) 

If  PjlX)  and  p,(X)  are  normal  densities  this  is  equivalent 
to  choosing  class  1 if  and  class  2 otherwise. 

Ftott  pT^tt 

«-  *- 

Clearly  this  procedure  is  optimal  (minimizes  the  probability 
of  error)  in  the  normal  case. 

Let  us  call  the  above  discriminant  function  G(X).  Although 
G may  be  the  optimal  discriminant  function  only  in  the  case  that 
Pj  and  p,  are  indeed  normal,  it  may  be  a good  approximation  to 
an  optimal  discriminant  in  other  cases.  We  shall  first  prove 
a result  in  this  direction  and  then  show  how  to  adjust  the 
constant  term  of  the  discriminant  G(X)  to  yield  minimum  prob- 
ability of  error  using  third  and  fourth  moments.  We  then 
give  minimal  variance^"'  solutions  to  the  problem  of  finding 
the  optimal  quadratic  discriminant  function.  These  solutions 
are  of  minimum  error  in  several  asymptotic  cases.  Again  the 
first  four  moments  are  used.  finally  general  discriminant 
(unctions  are  derived  using  marginal  distributions  or  moments 
ot  the  form  x.1  x..  Upper  bounds  on  the  probability  of  error  of 
these  discriminants  are  given  in  asymptotic  cases  using  the 

(11  These  solutions,  introduced  in  section  111,  are  so 
named  since  they  involve  the  minimization  of  certain  weight- 
ed sums  of  the  variances  of  a discriminant  function  under 
each  hypothesis. 


central  limit  theorem  for  finitely  dependent  random  variables 
(see  [3]  ) . 

II.  THU  GAUSSIAN  DISCRIMINANT 

Theorem  1 ^ h (G)  = f > G(X)p.(X)dX  _>  0 

P 1 R 1 

1 / , G (X) p , (X ) dX  = I!  (G1 
R 4 1 - 

Proof:  Suppose  Y = A X is  a non-singular  linear  mapping. 

l'or  a probability  density  6 we  have  Mj:  = A and 
2^  = A2S  Ar  where  and  2^  a re  the  mean  and 
covariance  matrix  for  the  transformed  density 
6 (Y)=  | A | ~ *5  (A  *Y)  . As  on  page  34  of  111  let  us  deter- 


whcrc  i denotes  p. 


^ This  result  was  stated  without  proof  in  [2]  for  the 
v a r i a t e c a s e . 


un  1 - 


Now  Ep  (G (X) ) = E~  (g(X(Y)))  = 


j^(Y-M^)  tA"1  (Y-M^)  - + 

i In  | A - \ In  |A  ^A*  |j 


, d 

= Epx  I 2 — <?i  * MIi^  + ln  Ai 


= 7 | lll~Pl  [ + <MirM2i>  ]2  - 1 + lnxi 

1 

• 7 f i.  * r. (Mn‘M2i)2  - 1 * ln  si  i 0 


since  0 < X.  < °°  and  z-1  > ln  z for  all  z > 0. 

l — 

Similarly  [PCX)]  = E~  [Q(X(Y)}] 


"if1-  EJ2[a,i-M2i>  * CM2i  Mli>] 

• 2 f (‘  - xi  - <M2i-Mn’2  * ln  »i) 
i I f (l-»i  * !"  »i)  i 0 


+ lnX  i 


This  completes  the  proof  of  the  theorem. 


From  the  above  F^  (G(X))  - F.^  (G(X))  = 


if  t‘4.)  * \ -2] 


which  is  the 


divergence  when  p ^ and  p , are  normal  densities.  This  quantity 
is  always  positive  when  the  first  two  moments  of  ( X | to , ")  and 
C X 1 to , ) are  not  identical.  This  follows  since  i + z > 2 for 
all  ; > 0 , : + 1 . 

We  now  discuss  the  adjustment  of  the  constant  term  in 
G(X).  This  is  equivalent  to  finding  c which  minimizes  the 

S 

error  ot  the  decision  rule  G - c t.  This  error  function 
mav  be  written 


E(c)  = riw,-)  j dp,  + r f to ^ 1 J dp 


1 


G - c > t 


G - c < t 


If  p.  and  p,  are  normal  densities,  then  c=0  is  optimal.  For 
the  general  case  consider  the  graphs  of  P (co ^ ^ and  P(«,')y,, 
where  y.  and  are  the  density  functions  of  the  random 
variables  G(X/u>.)  and  G(X/to,)  respectively. 


l:(c)  will  then  be  the  area  of  the  shaded  region.  This  is  minimum 


for  c = a-t  where  a is  a solution  of  the  equation  P(a>2)y2  (a)  = 
(a)*  Again  in  the  normal  case  a = t by  the  optimality 
of  the  decision  rule  G <_  t.  This  is  however  not  always  true 
in  the  general  case  as  will  be  demonstrated  below. 

We  now  estimate  a.  Since  both  G(X/w^)  and  G(X/o>2)  can 
be  expressed  as  sums  of  d random  variables,  the  central 
limit  theorem  implies  in  many  cases  that  Yj  and  Y2  approach 
normal  densities  for  large  d.  This  occurs  for  instance 
when  the  y^  in  the  proof  of  the  above  theorem  are  independent 
under  the  hypotheses  and  co2.  We  then  estimate  a by  a 
solution  of 


[a  - ECGCX/o^))]2 

[a  - E(G(X/u2))]  2 

lvar(G(X/«2))\ 

2 Var  (G  (X/cuj ) ) 

2 Var(G(X/a>2)) 

/ — \ 

CM 

3 

V ' 

H 

flVar(G(X/u1))J 

The  ca  1 cu  1 at  i nns  are  as  follows: 


Ef,  i.  ^ ((yi-Ma,2(yj'Hij,3(r:'1)(r:‘1) 

[_  l , J = 1 \ 1 J 

-2<>vMii>2  (rr-'Xr:-1) 


Mfyi-M^Hy  -MJ  -MJ  J 


• Efi  [(yl-*il)2(yJ-HI,)2-l] 


s 

i.j  = l 


. s <HIi-M-2i)2 

i = l 2 

XT 

l 


% [(yi'H!i)2<yj‘Mu)] 


Var  IG  ( X / 0)  2 ) 1 


1 d /Cy • -M? - ) 2 2 ' 

xi  Cyi_MIi5  "1+Xi"‘CM2i-MIi) 


■ ep2  | 2(yi-Mj1)2(r7-i)-2(yrM2i)(M5i-«ji)-(i-xi 


* 


■ ■>,  J.  ? 'rr-'Xr:-1’  * 

'2  i,j-l  1 ‘ i 1 


(1-XjHl-J  ) 


■40V«51>20'j-M5j>(H2rMI},<rr-1) 

•4lyi-MH)(yrM-2jHM5i-M!i)(M-2j-Mij)| 

■ { z.,  <x\-1)'rr-1)  % [<>'i-M2i>2i>'j-Mij)2  -1 


♦ £ (v~  l)(Mr.-M=.)  E~  F(y.-M:j.)2(y.-M:?.)l 

i,j=l  xi  2J  P2  L 1 21  J 2J  J 


i = 1 


Xi(Mri-M2i) 


The  above  third  and  fourth  moments  may  be  easily  estimated 

from  sample  data  by  using  the  transformation  A of  the  theorem. 

The  solution  we  then  choose  for  a is  that  solution  for  which 

Ploij^Yj  is  increasing  more  rapidly  than  PO , ) Y,.  i.c.,  for  which 

EfGfX/oij ) )-a  E(G(X/cu2))-a 

Var  (G  (X/oij  ) ) - Va FHTTX7^TT . 

III.  Till:  MINIMAL  VARIANCE  SOLUTION 

Del. L:  Eet  D he  a class  of  probability  densities  p(m,v) 

on  the  real  line  parametrized  continuously  by  their  means  m 
and  variances  v (meR,  vcVcR).  D is  called  translational  if 
i n add i t i on 


U(m,,v)(x)  = p (m9 ,v)  (x+m,-m. ) for  all  real  x. 


l)e  f . 2 : Let  I)  be  a translational  family.  Then  1)  is  said  to 


be  of  monotone  error  if  in  addition  to  the  above  the  following 
is  satisfied:  For  each  0 < a < 1 and  any  two  members  of  l), 
n(0,v.)  and  p(l,v?),  the  function 


E (v. ,v,) 
a v 1 2 ' 


■ "■"(£ 


( 1 - a ) u ( 1 , v , ) d x + 


/•+QO 

i a 

J c 


g(OtVj)dx 


is  differentiable  in  and  v,  and 


3E 
c 

3v  - 


3E 


3 v 


> 0 for  all  v j , v , e V . 


Theorem  2 Let  Vj(a),  v;la)  be  differentiable  mappings  from 
some  parameter  space  AcRn  into  the  variance  space  V of  a 
monotone  error  family  1).  If  a'  is  a local  extremum  of  the 
function  E^  (.v ^ (a)  , v (a) ) then  a'  is  a local  extremum  of 
3 v j (a)  + (1  •H)v,(3)  for  some  - 1 £ 3 £ 1 . 

proo f : Let  a'  be  a local  extremum  of  E ( v ^ (a ) , v , ( a ) ) . 

Then  VEa(Vl(a'),v2(a'))  « 

r3E. 


( 


3 v j I a 7 


Vv 


(a')  L } V v , ( a ' ) = 0 

I a '/ 


Then  for  some  3,  -l<3£l,  V(3Vj  + (l-|3|)v,)  = 0 at 
a'.  If  E>  (v^,v,)  is  concave  at  v.(a'),  v,(a’)  and 
3£0 , it  can  be  shown  that  a'  is  a local  minimum  of 
3v.  + ( l - 3 ) v , . 

1 L 

Hence  to  minimize  11  (v.  (a)  , v,  (a) ) we  need  only  consider 
functions  of  the  form  3v[ta')  + (l-mv^a")  for  various  values 
of  3.  These  may  be  much  easier  to  handle  numerically.  IVe 


9 


now  apply  these  results  to  optimal  discrimination.  Suppose 
we  consider  a certain  class  of  discriminant  funct ions  L (a) . 

We  may  further  restrict  this  class  in  such  a way  that 
K(L(a)/(o,)  - EfLta)/^)  = 1.  If  the  original  parameter 
space  A is  rich  enough  to  include  the  Gaussian  discriminant 
and  is  homogeneous  and  translation  invariant  this  may  be 
easily  achieved  provided  that  the  first  and  second  moments 
are  not  identical  under  both  hypotheses  (see  Theorem  1).  If 
for  the  resulting  set  of  the  parameters  the  distribution 
functions  of  (L/o>^)  and  of  ( L/co  ^ ) behave  (asymptotically) 
as  those  of  some  family  D of  monotone  error,  we  may  determine 
that  discriminant  which  (asymptotically)  minimizes 
Ej,(w  -j  ( L/o)^  , L/oi  ^ ) by  considering  the  extrema  of 

0 Var  (E(a)/o)^)  + (l-jfj)  Var  (L(a)/oj^)  for  various  values  of 
0,  -1<_  0 <_  1,  and  either  a)  choosing  that  extremum  which  gives 
minimum  error  as  calculated  from  knowledge  of  the  family  D 
or  b)  when  D is  not  known,  choosing  that  extremum  for  which 

the  discriminant  performs  best  on  sample  data. 

One  may  at  this  point  ask  why  we  do  not  simply  choose  L(a) 

which  best  separates  the  sample  data.  The  reason  in  many 
practical  situations  is  again  that  sample  sizes  are  not  ex- 
ponential in  d.  If  the  parameter  space  A has  dimension  greater 
than  or  equal  to  d-1  (as  is  the  case  for  the  class  of  linear 


10 


discriminants),  it  is  highly  likely  that  the  "small"  samples 
will  he  well  separated  by  some  L(a).  However  this  I, (a)  will 
not  necessarily  perform  well  on  new  data  since  the  a is  not 
reliably  estimated.  Our  procedure  yields  a one  parameter 
family  of  discriminants  determined  from  the  (reliably  estimated) 
moments.  The  sample  size  need  not  be  exponential  in  d to 

reliably  estimate  the  (single)  parameter  fl. 

In  the  following  sections  we  will  describe  several  ap- 
plications of  the  above  procedures  and  indicate  which  limit 
theorems  in  Probability  guarantee  the  existence  of  the 
monotone  error  families  1).  liven  if  such  limit  theorems  do 
not  apply  the  procedure  b)  is  indeed  a second  order  solution 
to  the  problem  of  finding  the  optimal  discriminant  L(a),  in 
that  an  approximate  solution  is  derived  in  terms  of  the  means 
and  variances  of  L(a)  under  oij  and  o> , . We  call  this  method 

the  minimal  variance  solution. 

IV.  THli  QUADRATIC  DISCRIMINANT  FUNCTION 

Ry  an  affine  transformation  we  may  assume  that 
Xj,  x,,  ...  Xj  arc  uncorrelated  under  both  hypotheses  and 
further  that: 

Ii(x./u>.)  = 0;  li(x./o),)  = 1 
11  1 *- 

V a r ( x . / a)  i ) = a!;  Var(x./o>2)  = AT 


1 1 


Let  Q(X)  = £ a.,  x^x.  + £bixi»  In  many  cases  Q is  normal 

2 ^ 

or  (x»n)  for  large  d.  The  class  of  normals  and  the  class 

7 

of  ( x,n)  are  classes  which  are  characterized  by  their  means 
and  variances.  It  is  easy  to  see  that  such  classes  are  of 
monotone  error.  Hence  the  following  minimal  variance  solution 
may  indeed  be  (asymptotically)  optimal: 

ECQ/oij)  -£a..X* 

K(Q/„2)  -.2  ajj  * 2ail(mJ)  »2bi 

We  want  to  find  extrema  of  3 Var(Q/co^)  + (1-|3|)  Var ( Q/cu ^ ) sub- 

2 1 

ject  to  £ a..  + £b.  + £a..  (1+X. -X.)  = 1.  Using  a Lagrange 
i<j  1 11  1 

multiplier  <t> , we  determine  extrema  of 

3E.(T  a..x.x.  + £ a . . (xT  - X ! ) + £ b • x . ) ^ 

1 lj  l l " n v l iJ  " l 

i<  j J J 

+ (1-N)E2  ( £ a..(x.x.-l)  + £ a..Cx^-1-x?)  + 2 b^x.-l))2 

- * ( £ a . + £ bt  + £ a..(l+A?-x})  - 1) 
i<  j J 

Differentiating  with  respect  to  a.,  and  b.  and  setting  equal 
to  zero  we  obtain: 

2 e?ka*-k  [6E:1  Cx«.xkxixj)  * (1-PI)E2(xixkxixj  -:) 

+ 2£b£  JbEjCx^x.x^)  + (l-|3l)E2(x)lxixj  -1) 

+ 22aJU  [PHl(xJxiXj)  + (1'l^E2(x£XiX.i'1'XM 

= <J>  for  i < j 


r 


[eEiu)ixkxi)  + (HehE2(Xllxkxj-i-xf)] 

2 2be  [gEj(xtxp  ♦ (Hel)E2(Xjlx^-l-X?)] 

2 2ail  aJ)  ♦ (Hfi()E2(x2xf- Cl*xf ) d*X^] 


(i+xf-x!)  « 


and 

2 £katk  [8Ellx«xkxi>  * d-WlEjlXjX^Xj-I)] 

* 3lau  [BEjCxJxj)  * (l-|B|)U2(x2xrl-X2)] 

♦ 2 b.  [g  x}  + (1-N)  Xf]  = « 

In  matrix  form  M It  = <t> ct  yields  it  = $ H ' 3 = f c.  From  the 

constraint  4>  = ( £ c..  + £c.  + Sc..(l  + X7-x!)  -1)  * 
v ; 1 j 1 11  1 1 ’ 1 

where  c^,  c.,  and  c.^  arc  the  components  of  c corresponding 


to  a . . , b . , and  a . . of  a. 
11  1 11 


Unfortunately  the  above  solution  requires  an  inversion 

1 ? 

of  a matrix  for  each  value  of  0.  Also  the 


asymptotic  class  may  be  unknown.  The  principal  axes  dis- 
criminant P =Z;lj  xj  +2bj  Xj  has  the  advantages  of  being 
more  numerically  feasible  (d  equations  in  d unknowns)  and 
behaving  normally  (large  d,  independent  x.)  in  many  cases. 
Its  minimal  variance  solution  is  as  follows: 

2 2 be  (x^xp  ♦ 1 1 -|p ()  1- 2 ^ x e x T -1-XT)J 
♦ 2ja{  [Pbjlx^i  'Xl  Xi)  * (l-|0t)b:(x“xr  - (l  + xjui  + xj))] 


= (1+XT-X.)  * 
1 1 ’ 


' 


1 3 


i 2at  [pEjIxJx.)  ♦ (l-fBl)B2(xJx.-l-xJ)] 

♦ 2b.  [»x\  ♦ (1-IpDAj]  - * 

The  above  system  can  be  reduced  to  d equations  in  the  d un- 
knowns a.  and  then  solved  for  each  value  of  ft.  The  solution 
la.,b.,i)  for  the  constrained  system  is  then  obtained  in  a 
straight  forward  manner.  Normal  tables  may  be  used  to 
estimate  the  error  for  each  ft. 

V.  Cl-NI-RAL  DISCRIMINANT  FUNCTIONS 

We  may  use  the  procedure  in  III  to  determine  higher 
order  discriminants.  However  this  may  be  numerically  im- 
practical even  for  discriminants  of  moderate  order.  Also 
for  large  dimensions  d and  relatively  small  (<d"')  sample 
sizes  only  marginal  densities  and  their  correlations  may  be 
reliably  estimated.  Hence  we  propose  a method  which  depends 
only  on  marginal  discriminants  and  their  correlations.  In 
two  asymptotic  situations  upper  bounds  on  the  error  rate 
are  obtained. 

Let  us  assume  that  X|,x,,...  are  uniformly  bounded 
but  relax  any  other  previous  assumption.  (However  in  many 
practical  cases  "uncorrelating"  the  x^  may  increase  the 
applicability  of  a limit  theorem.)  Lot  f . i x j ) be  a dis- 
criminant function  for  the  one-dimensional  random  variable 

14 


x^.  For  simplicity 


let  us  assume  that,  for  all  dis- 


(2) 

criminants  f^(x^)  henceforth  considered,  E(f^/o>^)  ^ E(f^/a)2). 

One  choice  of  f ^ is  (an  estimate  of)  the  log-likelihood 

ratio  of  the  marginal  densities  of  x^,  In  p^(x^)  - In  pj(x^). 

* Consider  the  discriminant  function 

d 

D(X)  = 2 a . f . (x.) 

1 1 1 1 

To  obtain  the  minimal  variance  solution  we  find  extrema  of 
BE1(2  aifi(xi)  - 2 a.E1(f.(xi)))2 

+ (1-|3DE2(  2 aif.(xi)  -2  aiE2(f.(x.)))2 

* * (2  aiE2(fi(xi))  -2  aiE1(fi(xi)) 

These  are  given  by  - 

-f  — ^ 

a = $ M b = $ c 

where 

bi  ‘ E2(fi5  ' El<fP 

«ij  ’ 2BE1  [fifj  ' El(fl)Eltfj’] 

* 2(l-fel)E2  \titj  - E2(fi)E2(f.)] 

and  * -fe  ci(E2(fi)-E1(fi))J 

• The  following  definition  is  adapted  from  [3]. 

Def.  3:  An  infinite  sequence  of  random  variables  x is 
said  to  be  finitely  dependent  if  for  every  nonempty  finite 

(2)  This  is  hardly  a restriction  since  in  practical  cases 
estimates  of  the  expectations  mentioned  will  rarely  be 
the  same. 


■ 


tJ 


15 


subset  of  the  variables  A there  exists  another  finite  sub- 
set B(A)  (including  A)  such  that  (x^  c A}  is  independent  of 

{x.  c Rc } and  sup  L-(AJ_L  < « f where  ||  denotes  the  cardinality 
1 i A| 

of  a set. 


If  our  sequence  x ^ , x-,,  ...  is  finitely  dependent  then 
it  follows  that  the  sequence  (fj(x^)}  is  also  finitely  depen- 
dent. By  the  results  of  [3]  the  discriminant  function 
d 

1)(X)  =2  a.f.fx.)  will  be  normally  distributed  for  large  d 
1 11  1 


1 


provided  that  the  a Vs  become  sufficiently  small  as  d 

becomes  large.  This  has  the  following  useful  consequence. 

Theorem  3:  Let  x^,  x , , ...  be  finitely  dependent  and  suppose 

f.(x.)  is  the  log- 1 ikel ihood  ratio.  Suppose  further  that 

x.  , x.  , ...  are  independent.  If  for  large  d the  minimal 
ll  l2 

variance  solution  D(X}  has  coefficients  which  satisfy  the 
above  "smallness"  conditions,  then  the  probability  of  error 
using  I1(X)  will  become  bounded  above  by  the  Bayes  error  for 


the  sequence  x. 


1 


x . 

It 


ik  (VJ<  w 


outline  of  proof:  the  optimal  discriminant  function  for 

1 k 

x • X ...  X.  is  D(X)  = l y f.  (x.  ). 


1 


1 ‘2  lk  “ 1 -s  s 

As  d becomes  large  k becomes  large  and  the  above 
coefficients  (^)  will  satisfy  the  "smallness" 


(3)The  exact  conditions  on  the  a;  in  terms  of  the  variances 
of  the  f i t x ) can  he  determined  from  Corollary  4.2  on  page 
232  of  [3]  . 


lo 


condition.  Since  the  minimal  variance  solution 

is  unique  it  will  have  minimum  error  among  all 

those  discriminants  whose  coefficients  satisfy 

the  "smallness"  condition  (since  this  is  a 

connected  set  in  the  coefficient  space). 

In  many  practical  situations  the  marginal  densities  are 

not  available.  We  introduce  a method  of  minimal  marginal 
moment  variance. (m.m.m.v. ) Let  x be  a one-dimensional  random 
variable.  Suppose  y^,  y , , ...  is  independent  and  identi- 

cally distributed  (i.i.d.)  with  the  same  distribution  as 
x under  both  hypotheses.  Consider  a polynomial  discriminant 


function  P, 


i (Y)  = 2 2a..y^. 


iji 


The  minimal  variance  solution 


*■  i d Q _ • _ 

will  be  of  the  form  P^O')  = JJa.y.  where  a.  are  extrema 

Q _ . Q tT  1 1 Q3  1 . -1 

of  8 Var(i  a.xJ/<a.)  + (l-$)  Var  (2  a.x-'/w,) 

Q . Q _ . 1 

- $ (E,(Sa  xJ)  - E1(2a.xJ)  -1). 

*”  1 1 

Differentiating  yields  - 

-f  n 

= 4>  (E2(xj)  - E x ( x •* ) ) . 


J - x^jfx4))  + (l-|8|)E2(x!l  + -i-x-iE;2(xS'))] 


Using  normal  tables  the  correct  8 is  determined  and  the 

corresponding  a.  calculated  from  the  variances  of  Pq. 

Def . 4:  The  m.m.m.v.  discriminant  function  of  x of  degree  (d,Q) 

is  given  by  Fn(x)  = 2,  a.x-1  where  the  a.  are  the  above 
x i J .1 

minimal  variance  coefficients. 


Theorem  4:  Let  x^,  x,, 


. ..  be  finitely  dependent  and  suppose 


f i (.x t ) is  the  m.m.m.v.  discriminant  function  of  x^  of  degree  (k,Q). 

Suppose  further  that  x.  , x.  , ...  are  i.i.d.  If  for 

1 x2 

large  d and  large  Q the  minimal  variance  solution  D(X) 
has  coefficients  which  satisfy  the  previous  "smallness" 
conditions,  then  the  probability  of  error  using  D(X)  will 
become  bounded  above  by  the  Bayes  error  for  the  sequence 


(ik  1 


W’ 


out  1 ine 


• roof : 


the  optimal  discriminant  function 


D ( X ) 


*(xi  ) 


< 


where  Hlx.  ) is  the  log  likelihood  ratio.  Since 
1 s 

for  large  Q this  may  be  well  approximated  by 

1 k Q _ i 

Dn(X)  = r J S S-  x-  where  g.  are  determined  by 
k K 1 1 -1  \s  J 

fitting  H(x.  ) to  a polynomial  of  degree  Q.  But 
x s 

for  large  k the  minimal  variance  solution  for 

xi  * xi  * •••  1S  NHX)  = ^2^3.  xj  = T (x.  1 

1 1 1 2 1 1 J 1 s K 1 ^ 1 s 

It  will  satisfy  the  smallness  condition  and  its 

error  will  approach  that  of  which  in  turn 

approaches  that  of  D.  Hence  the  error  of  M 

approaches  the  Baves  error  of  x.  , x.  , ...  x.  . 

*1  X2 

By  arguments  similar  to  the  proof  of  Theorem  3, 

the  error  of  D(X)  will  become  bounded  by  the 

Bayes  error  of  x.  , x.  , ....  x.  . 

*1  x,  *k 


We  conclude  from  these  theorems  that,  even  if  we  do  not 


know  which  subsequence  x.  , x.  , ...  of  the  process  x,  , x,,  ... 

X1  x2  1 1 
is  an  independent  sequence,  we  may  discriminate  the  finitely 

dependent  process  (asymptotically)  as  well  as  we  could  dis- 
criminate x.  , x.  , ...  if  the  {i.}  were  known. 

*1  x2  J 

> 

i 

* 


i 

| 


i 
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